Week 10.1 - Agents: What They Are and What's Actually New in 2026

🎯 What We'll Cover

For the last two years, “AI agent” has been one of the most over-used and under-defined phrases in the field. Vendors apply it to everything from a chatbot that can call a weather API to a system that runs unsupervised for hours rewriting a codebase. If we are going to think clearly about what agents mean for research, we need a working definition that survives the marketing.

This sub-lesson gives you that definition — an agent is a model wrapped in a harness of tools, memory, loops, and permissions — and then makes one central observation that frames the whole week: since 2024, the harness has become the product. The base model matters less than it used to; how it is scaffolded matters more. We will look at the May 2026 agent landscape, the surprising trend of coding agents being repurposed for general knowledge work, and what all of this actually means for a working researcher.

A word of calibration, carried over from Week 9: every concrete claim in this sub-lesson is time-stamped. The agent landscape moves faster than almost anything else in AI right now. What is true in May 2026 will be partly false by November. Treat the specifics as a snapshot and the framing as the durable part.

🤖 What Is an Agent? A Working Definition

Strip away the marketing and an agent is a language model placed inside a loop that lets it act, not just respond. Five components turn a model into an agent. The first is the model itself; the other four, taken together, are what this week calls the harness.

A model — the same kind of transformer doing next-token prediction we met in Week 2. Nothing about the underlying architecture has changed; the model is still predicting tokens. What is new is what those tokens are allowed to do.

🧰 The Harness — everything built around the model

The model is the engine; the harness is the chassis, wheels, steering, and brakes built around it. A model on its own answers questions. A model in a harness gets things done. These four components — plus the prompts and context-management that coordinate them — are the harness:

Tools — functions the model can call by emitting a structured request in its output stream: run code, search the web, read a file, query a database, send an email. The model does not execute these; the harness does, and feeds the result back in.
A loop — the harness runs the model, executes any tool call it requests, appends the result to the context, and runs the model again. This repeats until the model decides the task is done. A chatbot answers once; an agent iterates.
Memory — some way of carrying state across many loop iterations (and sometimes across sessions), because a useful task can span dozens or hundreds of steps that overflow a single context window.
Permissions — the boundary of what the agent is allowed to do without asking. Read-only? Allowed to write files? Allowed to spend money? This is the single most important safety dial, and the one researchers most often leave on the wrong setting.

Keep that word — harness — in mind. The central claim of the next section is that in 2026 the harness, not the engine, increasingly decides how well the whole thing drives.

💡 Drop the word “autonomy”

Popular coverage frames agents as a binary: either a tool “is autonomous” or it is not. This is unhelpful. Autonomy is a spectrum set by the permissions dial, not a property of the model. The same model is a cautious assistant when it must confirm every action, and an unsupervised delegate when it can act freely for an hour. When you read that a system “is an autonomous agent”, the useful question is never “is it autonomous?” but “how much is it allowed to do, over how long, before a human looks?”

This definition lets us draw two clean distinctions. A chatbot with browsing is not yet an agent in the full sense — it makes one tool call and answers. A 2023-style “chain” (a fixed, pre-scripted sequence of model calls) is not an agent either, because the model does not decide what to do next; the developer hard-coded the steps. An agent chooses its own next action inside the loop. That difference — the model, not the developer, steering the loop — is what changed between 2023 and 2026, and it is where both the new capability and the new failure modes come from.

🔧 What's New Since 2024: The Harness Is the Product

Here is the single most important shift to internalise. In 2023, agent performance was dominated by the base model: a better model gave you a better agent, full stop. By May 2026, that is no longer true. Agent performance is a joint property of the model, the harness around it, and the strategy for packing the right context into the model at each step. The harness — the prompts, the tool definitions, the context management, the evaluation loop, the model-specific tuning — has become the thing that companies actually build and sell.

This is not just a claim from practitioners. It is now the subject of formal research. The clearest statement comes from a March 2026 Stanford preprint, Meta-Harness: End-to-End Optimization of Model Harnesses (Lee et al.), whose opening line is almost a thesis statement for this whole week:

📑 From the Meta-Harness paper (Lee et al., Stanford, March 2026)

“The performance of large language model systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model.”

The paper builds a system that automatically searches over harness code — keeping the model fixed — and improves on a state-of-the-art context-management baseline by 7.7 points while using four times fewer context tokens. The point for us is not the system itself but the demonstration: a measurable chunk of an agent's performance lives in the harness, independent of the model, and is large enough to optimise on its own.

The companies building these systems show the same effect in detail. In a February 2026 engineering write-up, the LangChain team took a single coding model (GPT-5.2-Codex) and, without changing the model, improved their agent's score on Terminal-Bench 2.0 by 13.7 points — from 52.8% to 66.5%, moving it from outside the top 30 to the top 5. Every point of that gain came from the harness:

🔧 A worked example: 13.7 points from the harness alone

LangChain's recipe for the jump from 52.8% to 66.5% on Terminal-Bench 2.0 (same model throughout) was entirely harness engineering:

A better system prompt — explicit planning, build, verify, and fix phases, with emphasis on edge-case testing and checking output against the task specification.
Middleware in the loop — a pre-exit checklist that forces verification before the agent declares itself done, an onboarding step that maps the working environment, and a loop-detector that interrupts the agent when it gets stuck repeating an edit.
Context management — mapping the directory structure and available tools up front, and warning the agent about its remaining time budget.
A “reasoning sandwich” — allocating more reasoning effort to the planning and verification phases than to the execution in between.

None of this touched the model's weights. It is a concrete demonstration of the thesis: a large slice of agent performance is engineering of the scaffolding, and it is exactly the kind of thing a tool vendor can do that you, choosing between tools, cannot see from the model name alone.

And it generalises across models. In a May 2026 follow-up the same team reported that, on a different agentic benchmark (tau2-bench), prompts and middleware alone moved scores by 10 to 20 points — a roughly 20% lift on one coding model and 10% on another — and that the same harness techniques apply when driving open-weight models such as Kimi, Qwen, and DeepSeek (which we return to in 10.5). The lesson holds regardless of whose model you use: the harness is doing a large and measurable share of the work.

⚠️ What this means when you read a benchmark

If the harness can swing a score by ten points or more, then a leaderboard number attached to a bare model name is close to meaningless without knowing the scaffolding that produced it. This is not hypothetical: Terminal-Bench 2.0 (the benchmark itself, arXiv:2601.11868) is explicitly run through a named harness, and different harnesses on the same model produce different scores. When a tool claims “state-of-the-art on benchmark X”, the honest question is: which harness? A strong model, badly scaffolded, will underperform a weaker model in a better harness. This is the agentic version of the Week 9 lesson that benchmark numbers need their context to mean anything.

There is a research-tooling consequence too. The agent-infrastructure companies are now shipping harnesses as products: at LangChain's Interrupt conference (13–14 May 2026), the company released a cluster of harness-building tools — a hosted runtime for “deep agents”, a faster trace database, sandboxes, a context-management hub, and a gateway that enforces spend limits and redacts personal data before requests leave your environment. The takeaway for a researcher is not that you need any of these, but that the centre of gravity in the field has moved from “which model” to “which harness”. When you choose an agent tool, you are choosing a harness as much as a model.

🗺️ The 2026 Agent Landscape in One Picture

As of May 2026, the agent ecosystem sorts into roughly five families. The boundaries are blurring fast (see the next section), but this map is a useful starting orientation. We will return to specific tools, and to the honest free-versus-paid picture, in Sub-Lessons 10.3 and 10.5.

Family	What it does	Representative tools (May 2026)
Coding agents	Read, write, run, and debug code across a whole project, often unsupervised for long stretches.	Claude Code, OpenAI Codex, Cursor, Cline, opencode
Computer-use agents	Drive a graphical interface — click, type, scroll — the way a person would, for tasks with no API.	Anthropic Computer Use, Codex computer-use mode, the Operator lineage
Research agents	Plan a multi-step search, gather and cross-check sources, and synthesise a cited report. “Deep Research” modes.	Deep Research modes across the major chat assistants (covered in 10.5)
General agents	Attempt arbitrary multi-step tasks across tools and the web from a single instruction.	Manus, ChatGPT Agent
Creative agents	Drive creative software (3D, audio, design) through tool connectors.	Claude with creative connectors (Blender, Ableton, and others)

A concrete, datable example of how fast the creative corner is moving: on 28 April 2026 Anthropic shipped nine creative tool connectors for Claude — including Blender, Ableton, Adobe and Autodesk integrations — and, notably, made them available on the free plan. We will treat “what is actually free” as a first-class question throughout this week, because for many students at UCT it is the question that decides whether a tool is usable at all.

🖥️ A threshold moment: computer-use at human speed

The computer-use family crossed a noticeable threshold in April 2026. Ari Weinstein of OpenAI, posting about the Codex computer-use mode, wrote on 16 April: “This is the first time I've ever seen an LLM operate a GUI as fast as a person, and it's surreal.” Two weeks later he reported that a Codex app update had made one computer-use task run 42% faster again.

Two honest caveats. First, these are first-party claims — the person shipping the product describing his own product — not independent measurements, so treat the 42% as a vendor figure. Second, “as fast as a person” is about speed, not reliability; a fast agent that takes a wrong action quickly is not obviously progress (we develop exactly this distinction in 10.2). But the qualitative marker is real and worth recording: as of mid-2026, an AI driving a graphical interface by clicking and typing is no longer visibly slow. The bottleneck has moved from speed to judgement.

📝 A note on naming

Throughout this course we refer to model families — “Claude (family)”, “GPT (family)”, “Gemini (family)” — rather than chasing version numbers, because the versions change monthly. Agent products are different: “Claude Code”, “Codex”, “Cursor”, “Manus” are tool names, and we keep them, the same way we kept “Whisper large-v3” in Week 8. When you see a specific version number attached to a benchmark figure, treat it as a dated citation, not a recommendation.

🚀 Agents Are Breaking Containment

The most interesting throughline of April–May 2026 is that the families above are bleeding into one another. Tools built as coding agents are being pointed at general knowledge work, because the thing that makes a good coding agent — a robust loop, file access, the ability to run for a long time without losing the thread — turns out to be exactly what a good general agent needs too.

In the space of a few weeks, the major coding-agent products launched explicit “for work” or “for non-code tasks” surfaces. OpenAI's 16 April 2026 Codex update — which the company itself titled “Codex for (almost) everything” — explicitly repositioned a coding tool for knowledge work: operating your computer, working across apps like Slack, Gmail and Notion, remembering preferences, and scheduling its own future work. Anthropic pushed Claude into knowledge-work and creative settings over the same period. (A reported general-work product circulating under the name “Orbit” had not, as of mid-May 2026, been officially launched — treat it as a rumour until it ships.) The pattern, though, is real and on the record: the coding harness has become a general-purpose work harness.

🍽️ Why this matters for a researcher

You do not need to write code to be affected by this. A coding agent that can read a folder of files, run a script, and edit a document is, functionally, a general research assistant that happens to have been built by people optimising for software tasks. Increasingly, the most capable “do this multi-step thing for me” tool available to a researcher is a coding agent wearing a different label.

The flip side is that the caution travels with the capability. Everything Week 7 said about verifying AI-generated code — that plausible output can be silently wrong — now applies to a delegate that runs for an hour and touches many files before you see the result.

🧮 What This Means for Research Workflows

It is tempting to read all of this as “AI is about to do research for us”. That is the wrong framing, and the rest of this week is largely about why. The accurate framing is narrower and more useful: research workflows now routinely include long-running, tool-using delegates. The question is no longer “can the model answer my question?” but “should I hand this multi-step task to an agent, and if so, how do I stay in control of what it does?”

That shift raises the stakes on the Week 7 lesson about “vibe coding” — accepting AI-generated work without understanding it. When a chatbot writes a function, you can read the function. When an agent runs unsupervised for an hour, you inherit whatever it produced, and the surface area for silent error is far larger. The genre of “developer inherits a vibe-coded codebase and finds the cleanup involves deleting millions of lines of accumulated machine-generated cruft” became a recurring story on developer social media through 2025–26. The figures in any single anecdote are unverifiable, but the structural point is not: an unsupervised agent can accumulate a great deal of plausible-looking output, and a great deal of it can be wrong, before any human reviews it.

The frame for the week: the harness, not the model

Week 9 gave you a frame for capability claims: ask which model, which date, has anyone retested. Week 10 adds the layer above it. For an agent, the model is only one of five components, and often not the one that determines whether the thing works. So the agent-era version of the question is: which model, in which harness, with which permissions, verified how?

Hold onto that. In 10.2 we apply the Week 9 patched / reduced / structural failure taxonomy to agents specifically — because agents fail in genuinely new ways that single-turn chatbots did not.

📖 Sources & Further Reading

Primary sources for the claims above:

Lee, Y., et al. (2026). Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv:2603.28052 — the “harness is the product” thesis, with measured results.
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command-Line Interfaces. arXiv:2601.11868 · tbench.ai leaderboard — the harness-run benchmark referenced above.
LangChain, “Improving Deep Agents with Harness Engineering” (17 February 2026) — the worked 52.8% → 66.5% example and exactly which harness changes produced it.
LangChain, “Deep Agents 0.6” (13 May 2026) — the tau2-bench 10–20 point swings across models, and harness profiles for open-weight models (Kimi, Qwen, DeepSeek).
OpenAI, “Codex for (almost) everything” (16 April 2026) — coding agent repositioned for knowledge work.
LangChain, “Everything we shipped at Interrupt” (13–14 May 2026) — harnesses sold as products.
Anthropic, “Claude for Creative Work” (28 April 2026) — the nine free creative connectors.
Ari Weinstein (OpenAI) on X (30 April 2026; quoting his 16 April post) — the computer-use “as fast as a person” observation and the 42% speed-up. A first-party claim, cited as such.

For day-by-day tracking of these fast-moving developments, the AINews / Latent Space daily briefings are a useful aggregator — but note that they themselves aggregate company announcements and social-media posts, so trace any specific number back to its origin (as we have done here) before citing it. For a pragmatic practitioner's voice — including healthy scepticism about how much of this you actually need — Simon Willison's ongoing writing is consistently worth reading.

👉 What Comes Next

Sub-Lesson 10.2 — Failure Modes for Multi-Step / Long-Horizon Tasks. Now that we have a working definition and the “harness is the product” frame, we turn to where agents break. We take the patched / reduced-but-persistent / structural taxonomy from Week 9.2 and apply it directly to agents — because running a model in a loop for many steps introduces failure modes that no single-turn chatbot ever had. The headline finding, from a February 2026 Princeton study: across 18 months and 14 models, agent accuracy improved substantially while agent reliability barely moved. Accuracy is not reliability, and for a researcher the gap between them is where the danger lives.